Statistical Graphics for High-Dimensional Data

Susan VanderPlas

Statistics Department


Iowa State University

Motivation

Outline


  • Statistical Graphics
  • Big Data Challenges and Graphics
  • Case Study: Designing Interactive Graphics for Soybean Population Genetics
  • Next Steps

Statistical Graphics

Good Statistical Graphics

Function:

  • Show the data
  • Don’t distort the data

Form:

  • Show a consistent story
  • Provide several levels of detail
(Ideally)

Elegance:
How do I best communicate the data?

  • Perceptual Awareness
  • Visual Bandwidth (information overload)

Statistical Graphics Literature

History of Graph Criticism:

Examples of Good Graphics:

Guidelines for Creating Good Graphs:

Statistical Graphics Literature

User Studies and Experiments:

Dissertation

Which Plot has Homogeneous Variance?

4

Which Plot has Homogeneous Variance?

Sine Illusion

Sine Illusion - Explained

Perception is optimized for three dimensions.
Our brains sometimes inappropriately apply 3d heuristics to 2d images, producing optical illusions.

Sine Illusion

  • Case Study: DW, an individual who lacks binocular depth perception, is immune to the illusion

  • Subconscious (can’t be “un-seen”)

  • Affects perception of variability or height
    • Candlestick plots (finance)
    • Time Series
    • Scatter plots (nonlinear trend)
    • Streamgraphs
    • Stacked area plots


Source
Source
Source

Solution 1: Transform X


From: Signs of the Sine Illusion: Why we need to care (JCGS, 2015)

\((f \circ T)(x) = a + (b-a)\left(\int_{a}^x |f^\prime(z)| dz\right)/\left(\int_{a}^{b}|f^\prime(z)| dz\right)\) \((f \circ T_w)(x) = (1-w) \cdot x + w \cdot (f \circ T)(x)\)

Solution 2: Transform Y


From: Signs of the Sine Illusion: Why we need to care (JCGS, 2015)

\(l_{new}(x_0) = l_{old} \sqrt{1 + f^\prime(x_0)^2}\) \(l_{new_w}(x) = (1-w) \cdot l_{old} + w \cdot l_{new}(x)\)

Experimental Validation


  • Corrections validated experimentally
  • Amazon Mechanical Turk
  • 206 participants completed 1374 trials in 4 days
  • Goals:
    • Identify range of acceptable weight values
    • Examine whether weight values are specific to individuals or consistent within the population

Results

Findings

  • Sine Illusion affects our perceptions of statistical graphics

  • Corrections are effective at removing the illusion’s effects

  • Partial correction still is effective

  • “Don’t distort the data”: We need to be concerned with psychological distortion

Outline


  • Statistical Graphics
  • Big Data Challenges and Graphics
  • Case Study: Designing Interactive Graphics for Soybean Population Genetics
  • Next Steps

Big Data: Challenges and Graphics

Big Data

Visualization is an important tool for working with big data

Adaptations must be made:

  • Overplotting (large \(n\))
  • High-dimensional data (large \(p\))
  • Distributed/multi-source data, hierarchical data
  • No solution (binning, dimension reduction, tours) works for every situation

Interactive Graphics

  • Provide additional information in response to user action

  • Simultaneously show more than 2-3 variables and their relationship (multiple linked plots)

  • Accommodate complex data structures

BUT…


Web-based interactive graphics may be even more size-sensitive than static graphics.

Interactive Visualization

  • Lacks the rigor of a grammar of interactivity

  • Design is a function of necessity (for now!), which can lead to sub-optimal graphics
    • Interactivity vs. Animation vs. Static Plots
    • Many types of interactivity, with different use cases:
      Brushing, linked plots, subsetting, zoom-and-filter

  • Perceptual research is limited
    • Extremely specific use cases
    • Low-level psychological effects
    • Testing paradigms are somewhat difficult

Interactive Visualization of Soybean Population Genetic Data

Soybean Project: People and Institutions

Overall Project Goals:

  • Understand historical yield increases
    100% increase in past 100 years; additional 70% increase by 2050 to meet food needs (World Bank)
  • Associate genetic features with phenotypic traits Disease resistance, yield, nutritional content, time to maturity

  • Communicate analysis results intuitively:
    • Target: Soybean farmers, plant geneticists
    • Provide full results (tables) and graphical summaries
    • Interface with existing databases and web resources

Data


  • Sequencing Data (79 varieties, 75GB processed and compressed)

  • Field Trials (168 varieties, 30 varieties with genetic data)

  • New crosses with highest yield varieties
    (sequencing + field trials)

  • Genealogy (1600 varieties)

Visualization Concerns:

  • Huge number of interesting genes (70 million ID’d SNPs)

Visualization Concerns:

  • Huge number of interesting genes (70 million ID’d SNPs)
  • 79 varieties, 20 chromosomes
  • Phenotype and genealogy information
  • Researchers tend to work on gene subsets:
    Must be able to zoom and filter
  • Optimized files for SNP results are still large (10 GB) and require significant computational resources

Visualizing SNPs

  • SNP: Single Nucleotide Polymorphism, a single basepair mutation
    (A -> T, G -> A, C -> G)
  • Shiny applet: Responsive applet for user-directed data subsets
  • Show multiple levels of detail (less detail = lower computational load)
  • Provide resources in the applet for user exploration (not just a reference tool)

Applet Design

SNP Population Distribution

SNP Applet Overview

Density of SNPs: Chromosome Level

SNP Density

Individual SNPs: Comparing Varieties

Variety-Level SNP Browser

Genealogy and Phenotypes

Link

SNP Linked Plots

Interactive Plot Design

Good Statistical Graphics

Function:

  • Show the data
  • Don’t distort the data

Form:

  • Show a consistent story
  • Provide several levels of detail
(Ideally)

Elegance:
How do I best communicate the data?

  • Perceptual Awareness
  • Visual Bandwidth (information overload)

Outline


  • Statistical Graphics
  • Big Data Challenges and Graphics
  • Case Study: Designing Interactive Graphics for Soybean Population Genetics
  • Next Steps

Conclusions

Next Steps

  • User Studies of Interactive Graphics
    • Eye Tracking
    • Click Recording
    • Content Questions
    • “At what point do humans get overloaded?”

  • Color Perception for Statistical Plots
    • Colorbrewer palettes for maps
    • dichromat R package to simulate colorblindness
    • Need for validated color schemes that work well for scatterplots, bar charts, and other statistical plots

  • Hierarchy of Visual Features

Goal: Understand which features are most visually important.

Hierarchy of Graphical Features; No color

Goal: Understand which features are most visually important.

Hierarchy of Graphical Features; With color

Other Projects

  • Animint - Extends the ggplot2 implementation of the grammar of graphics to interactive plots

  • USDA Soybean Population Genetics Research
    • Analysis of copy-number variants
    • Genome-wide association studies of identified SNPs
    • Genealogy database

  • Data Aggregation
    • Craigslist ads
    • OkCupid
    • Location-based energy prices

Summary

  • Visualization research is inherently interdisciplinary
  • Graphics are evolving with new data
  • Need to quantify perception to better evaluate graphs

Questions?

Acknowledgements

Computation

  • dplyr/plyr
  • reshape2/tidyr
  • CN.MOPS: CNV identification in populations of genetic data

Acknowledgements

Visualization Software

  • ggplot2
  • Animint
    d3 interactive web graphics using ggplot2 syntax in R
  • Shiny (RStudio) interactive web applets
  • Reveal.js (slides) with Rmarkdown and knitr

Acknowledgements

People

  • Heike Hofmann
  • Di Cook
  • Michelle Graham
  • Lindsay Rutter

Other Research

Visual Reasoning

  • Graphics research often uses the lineup protocol, a hypothesis test analogue for static graphics.

  • Goal: Understand correlation between graphical perception, lineup performance, mathematical reasoning, and classification skill.
Statistical Lineups

Visual Reasoning

Conclusion: Lineups are an inductive classification task using graphics; performance is not seriously impacted by spatial ability (outside of general aptitude).

Figure Classification Task

Graphical Features

Goal: Understand which features are most visually important.

Hierarchy of Graphical Features; No color

Graphical Features

Goal: Understand which features are most visually important.

Hierarchy of Graphical Features; With color